In [1]:
from __future__ import division, print_function, unicode_literals
%matplotlib inline
import os
import IPython.display
import numpy as np
import matplotlib.pyplot as plt
import data_io
import json_io
import utilities
import requests
import BeautifulSoup
import arrow
import twython
import textblob
# Keyboard shortcuts: http://ipython.org/ipython-doc/stable/interactive/notebook.html#keyboard-shortcuts
My objective for this effort is to demonstrate my data processing and analysis capabilities outside of my tradiational hyperspectral remote sensing work. I have played in that arena for a long time and picked up a good number of modeling and analysis skills. This present effort is meant to be a quick example of processing unfamiliar data using new tools and protocols. This work needs to be quick, efficient, and have a clear punch line. This notebook is where plan to explore these tools and the data they help me fetch. I'll use another notebook for the analysis once all the data is sorted out.
I made a statement a few days ago indicating that I would like to solve new types of problems. For example, I might treat a new movie as a collection of word feature vectors pulled out of a Twitter feed. I would then make statistical associations with other movies having known performance characteristics such as viewer retention and engagement. The validity of the association process could be verified by testing with lablled data. Results from such a process might be useful for someone's planning efforts.
Very very early this morning I could not sleep as I kept thinking about this little task. I kept reviewing in my head details of the objective in the section above. I need a way to make sense of text describing a movie. A quick search on Github turned up this great Python package: TextBlob. The text from the website says:
A library for processing textual data. It provides a simple API for diving into common natural
language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation, and more.
Behind the scenes it uses the packages NLTK and patterns. I haven't done much at all with natural language text processing, but this tool looks like a great place to start! TextBlob will return two metrics describing the sentiment of a chunk of text: Polarity and Subjectivity. Those two numbers will be a great starting point for visualizing this stuff.
Next, I found a nice web API for querrying information from both IMDb and RottenTomatoes: The OMDb API. This site used to reside at this other address http://imdbapi.com/, but not anymore. Take a look over there for an interesting writeup of the site owner's interaction with IMDb.com's lawyers. Very clever! Anyhow, you can use this service to very easily search for movie information pulled from IMDb and RottenTomatoes.
I found several Python packages on Github searching for Twitter's API service, here are two that seem most well-maintained: https://github.com/geduldig/TwitterAPI and https://github.com/ryanmcgrath/twython. Just from reading over each package, I really like Twython's minimalistic interface. I will probably go with that.
Sign up for Twitter developer API at https://dev.twitter.com/apps. I named my app MovieInfoPierreDemo. That name is so goofy, but I felt rushed! Once it's setup I need to grab the Consumer Key
and the Consumer Secret
. Its a bit confusing sorting through all the authentiction options. The Twython documentation finally had great advice if all one cares about is read access to Twitter: use Oauth2!
Next I wanted a list of interesting movies to play with. I found this list of titles from 2012 for sequels of popular movies: Sequel Movies 2012. I figure I'll need to manipulate some of that data by hand just to get it done quickly. I would normally write some code to automate this step, but right now this is a one-time deal.
Once the data is all assembled into a useable form, my plan is to compare words using the Bag-of-Words approach. This involves computing histograms of word frequencies for some ensemble of words (e.g. words collected from Tweets). There are several ways to compare histograms with the goal of computing a similarity metric. My favorite is the Earth-Mover distance. It's like this: given two diffrent histograms with the same bins, think of the two distributions as two piles of dirt. Then the EMD metric is the minimum amount of work a bulldozer have to do in order to make one pile of counts look like the other. Last year I wrapped up this Fast C++ EMD implementation as a Python extension for a work project.
In that project the distance between bins was simple: just the Euclidean distance between them in bin space. But in this task I am dealing with words as labels for each bin. There is no physical meaning associated with which word is represented in the adjacent bin. The words might be sorted alphabetically, or by size, or just at random.
About ten years ago I implemented the Levenshtein Distance in IDL, way before I ever started using Python. I could have translated that older IDL version over to Python, but it was actually easier to just go Googling for a Python implementation. One of the first results that came back was py-editdist. That's what I love about Python: if you need a new function there's a good chance somebody already implemented something similar and made their repo publicly-available.
Given all that brainstorming above, let's make a plan of action!
I am going to focus on acquiring data from various sources and aggregate it into a form suitable for visualization and analysis. I don't think I'll have enough time for any exhaustive analysis. The most I want to get done then is generating a nice visualization.
I make a short list of recent movies and stored the name and reay of release in a simple YAML text file.
In [2]:
fname_movies = 'movies.yml'
# Run this sets of lines to view the text contents of my movie file. Or just open up the file in your text editor.
# with open(fname_movies) as fi:
# for v in fi.readlines():
# print(v.rstrip())
info_movie_list, meta = data_io.read(fname_movies)
for item in info_movie_list:
print('{0} ({1})'.format(item['name'], item['year']))
I found a nice web API for querrying information from both IMDb and RottenTomatoes: The OMDb API. This site used to reside at this address http://imdbapi.com/, but not anymore. Take a look over at the old address for an interesting writeup of the site owner's interaction with IMDb.com's lawyers. Very clever!
Anyhow, let's use this service to search for movie information by using the Requests package to fetch data in two steps. The first is to determine the IMDb ID number for each movie in the list. With that information we can then pull down additional details from RottenTomatoes.
Note: I just found out that RottenTomatoes has their own API. I just now signed up for an API key and have read quickly through the documentation. It looks simple and easy-to-use. If there's time I'll probably switch over to that instead of OMDb.
In [3]:
omdbapi_url = 'http://www.omdbapi.com'
movie_name = info_movie_list[1]['name'] # index #1 should yield Hunger Games for 2013.
movie_year = info_movie_list[1]['year']
params = {'t': movie_name, 'y': movie_year, 'tomatoes': True}
response = requests.get(omdbapi_url, params=params)
info_omdb = response.json()
# Use IPython's builtin display function for nicely-formatted view of the response data.
IPython.display.display(info_omdb)
Notice some of the entries have some characters encoded with ampersande encoding. I woud like to decode all such occurances to regular text and I found a nice implementation of just such a function over at stackoverflow.com. I am calling it from my utilities.py
module. This next line fixes all occurances of these encodings.
In [4]:
# Undo any ampersand encoding in the text returned from OMDbAPI.com.
for k in info_omdb.keys():
info_omdb[k] = utilities.decode(info_omdb[k])
So! There's a lot of stuff in that response data, but for now I'm going to focus on just a few pieces of data: the text strings corresponding to the keys Plot
and tomatoConsensus
, the viewer ratings from IMDb and RottenTomatoes, plus the date the movie was released to theaters. Just to keep things simple here, I am going to copy out only the handful of fields I care about, and take care of date parsing at the same time.
As far as dates go, I really, really dislike using Python's builtin date and time tools. The good news is there now exists a much better choice for working with dates: Arrow. From the web site:
Arrow is a Python library that offers a sensible, human-friendly approach to creating, manipulating, formatting and converting dates, times, and timestamps. [...] Arrow is heavily inspired by
moment.js
andpython-requests
.
See here http://crsmithdev.com/arrow/#format and here http://crsmithdev.com/arrow/#tokens for date/time format details.
In [5]:
format = 'DD MMM YYYY'
date_released = arrow.get(info_omdb['Released'], format)
# print('date_released: ', date_released.year, date_released.month, date_released.day)
info_movie = {'Title': info_omdb['Title'],
'Plot': info_omdb['Plot'],
'tomatoConsensus': info_omdb['tomatoConsensus'],
'Released': date_released,
'imdbRating': float(info_omdb['imdbRating']),
'tomatoRating': float(info_omdb['tomatoRating'])}
IPython.display.display(info_movie)
In [6]:
uri_base = 'http://api.rottentomatoes.com/api/public/v1.0'
uri_home = 'http://api.rottentomatoes.com/api/public/v1.0.json?apikey=mfy52ff3xbgcdwxqr9fwvjw9'
Following very helpful instructions from Twython documentation and using my own general-purpose data_io
module for storage. This part is so much easier if all you need is access to the search API, and not access to any personal info.
In [7]:
# Set this flag to True when you need to generate a new Twitter API access token.
flag_get_new_token = False
fname_twitter_api = 'twitter_api.yml'
# Load Twitter API details.
info_twitter_api, meta = data_io.read(fname_twitter_api)
if not 'access_token' in info_twitter_api:
flag_get_new_token = True
if flag_get_new_token:
# Use my Twitter dev API credentials to fetch a new access token.
twitter = twython.Twython(info_twitter_api['consumer_key'], info_twitter_api['consumer_secret'], oauth_version=2)
print('Fetching new token...')
access_token = twitter.obtain_access_token()
# Store the token for later use.
info_twitter_api['access_token'] = access_token
data_io.write(fname_twitter_api, info_twitter_api)
print('New token stored: {:s}'.format(fname_twitter_api))
else:
twitter = twython.Twython(info_twitter_api['consumer_key'], access_token=info_twitter_api['access_token'])
# This little try/except section is the only way I know (so far) to determine if I have a valid Twitter access token
# when using OAuth 2.
try:
temp = twitter.get_application_rate_limit_status()
except twython.TwythonAuthError:
msg = 'Boo hoo, you may need to regenerate your access token.'
raise twython.TwythonError(msg)
Just for the fun of it here, let's print out some interesting tidbits about the current status of my Twitter API key. This includes some version numbers and the current status of various rate limits. If you are going to be calling the API frequently you might run against these rate limits. Watch out!
In [8]:
print('\nApp key: {}'.format(twitter.app_key))
print('OAuth version: {}'.format(twitter.oauth_version))
print('API version: {}'.format(twitter.api_version))
print('Authenticate URL: {}'.format(twitter.authenticate_url))
# Rate limit stats.
info_rate = twitter.get_application_rate_limit_status()
# Application limits.
n_limit = info_rate['resources']['application']['/application/rate_limit_status']['limit']
n_remaining = info_rate['resources']['application']['/application/rate_limit_status']['remaining']
t_reset = info_rate['resources']['application']['/application/rate_limit_status']['reset']
delta = arrow.get(t_reset) - arrow.now()
t_wait = delta.seconds/60
print()
print('Application limit: {} requests'.format(n_limit))
print('Application remaining: {} requests'.format(n_remaining))
print('Application wait time: {:.1f} min.'.format(t_wait))
# Search limits.
n_limit = info_rate['resources']['search']['/search/tweets']['limit']
n_remaining = info_rate['resources']['search']['/search/tweets']['remaining']
t_reset = info_rate['resources']['search']['/search/tweets']['reset']
delta = arrow.get(t_reset) - arrow.now()
t_wait = delta.seconds/60
print()
print('Search limit: {} requests'.format(n_limit))
print('Search remaining: {} requests'.format(n_remaining))
print('Search wait time: {:.1f} min.'.format(t_wait))
Follow search instructions using OAuth-2 at Twythow site here. The following set of links to Twitter's documentation that I found most useful:
I also found a nice ipython notebook online showing an example using Twython, unfortunately it was for the older version 1.0 Twitter API. The implementation details have changed with Twitter's API version 1.1.
The page Help with the Search API has this helpful tidbit of information when you expect a large number of return tweets. In this case it is important to pay attention to iterating through the results:
Iterating in a result set: parameters such count, until, since_id, max_id allow to control how we iterate through search results, since it could be a large set of tweets. The 'Working with Timelines' documentation is a very rich and illustrative tutorial to learn how to use these parameters to achieve the best efficiency and reliability when processing result sets.
Ok, now I've read through the Twython package documenation and the source code. The authors of his package have fully taken into account the advice above given by Twitter. The way forward here is to use the instance method Twython.cursor
. My wrapper is now a lot simpler than what I had earlier this afternoon! Woo!
Here is an example of the JSON contents of a tweet returned through the Twitter API.
{'contributors': None,
'coordinates': None,
'created_at': 'Sat Dec 28 17:26:58 +0000 2013',
'entities': {'hashtags': [],
'symbols': [],
'urls': [],
'user_mentions': [{'id': 848116975,
'id_str': '848116975',
'indices': [42, 58],
'name': 'ZZZZZZZ',
'screen_name': 'QQQQQQQ'},
{'id': 2202651295,
'id_str': '2202651295',
'indices': [59, 73],
'name': 'XXXXX',
'screen_name': 'YYYYY'}]},
'favorite_count': 0,
'favorited': False,
'geo': None,
'id': 416983627935670272,
'id_str': '416983627935670272',
'in_reply_to_screen_name': None,
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'lang': 'en',
'metadata': {'iso_language_code': u'en', u'result_type': u'recent'},
'place': None,
'retweet_count': 0,
'retweeted': False,
'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'text': 'Going to see the Desolation of Smaug with @AllenWellingto1 @hannahkovacs3',
'truncated': False,
'user': { XXXXX }}
What follows in the next cell are a few helper classes and functions to make future work easier and more fun.
In [15]:
class Tweet(object):
def __init__(self, json):
self.json = json
@property
def has_url(self):
return self.json['entities']['urls']
@property
def url_title(self):
"""Return title of first URL page, if URL exists.
"""
if self.has_url:
# Grab the first URL.
url = self.json['entities']['urls'][0]['expanded_url']
resp = requests.get(url)
soup = BeautifulSoup.BeautifulSoup(resp.content)
results = soup.title.string
else:
results = None
return results
@property
def is_retweet(self):
"""Indicate if this tweet is a retweet.
https://dev.twitter.com/docs/platform-objects/tweets
"""
return 'retweeted_status' in self.json
@property
def text(self):
results = self.json['text']
# Check to see if there any URLs embedded in text.
if self.json['entities']['urls']:
# Grab the first URL, crop ll URLs from text.
ixs = self.json['entities']['urls'][0]['indices']
results = results[:ixs[0]]
return results
@property
def id(self):
"""Twitter tweet ID.
"""
return int(self.json['id_str'])
@property
def timestamp(self):
"""Time when Tweet was created.
# e.g. Sat Dec 28 16:56:41 +0000 2013'
"""
format = 'ddd MMM DD HH:mm:ss Z YYYY'
stamp = arrow.get(self.json['created_at'], format)
return stamp
def to_file(self, fname):
"""Serialize this Tweet to a JSON file.
"""
b, e = os.path.splitext(fname)
fname = b + '.json'
json_io.write(fname, self.json)
@staticmethod
def from_file(fname):
"""Instanciate a Tweet object from previously-serialized Tweet.
"""
b, e = os.path.splitext(fname)
fname = b + '.json'
json = json_io.read(fname)
tw = Tweet(json)
return tw
#######################################################################
def search_gen(query, since_id=None, since=None, until=None, lang='en', **kwargs):
"""Generator yielding individual tweets matching supplied query string.
Parameters
----------
query : str, Twitter search query, e.g. "python is nice".
until : date string formatted as 'YYYY-MM-DD'.
"""
gen = twitter.cursor(twitter.search, q=query, since_id=since_id, until=until, lang=lang, **kwargs)
for json in gen:
tw = Tweet(json)
# Check each tweet for crap.
is_crappy = tw.is_retweet or tw.has_url
if not is_crappy:
yield tw
Let's try running a quick query for recent tweets about the current Hobbit movie. Notice below that I am also searching for any URLs in the text. I use a combination of Requests
and BeautifulSoup
to fetch the title of whatever page is at the other end of that URL. In addition to the text of the actual tweet, I also want the time (UTC), date, and geographic location.
Use id_str
instead of id
. See this discussion for details https://groups.google.com/forum/#!topic/twitter-development-talk/ahbvo3VTIYI.
In [16]:
# Practice search on a topic and extracting information from returned tweets.
q = 'Star Wars'
num_max = 15
gen = search_gen(q)
for k, tw in enumerate(gen):
print('\ntweet: {:d}'.format(k))
print('id: {:d}'.format(tw.id))
print(tw.text)
if k > num_max:
break
# Save tweet to JSON file.
# name = 'tweet_{:s}_'.format(tw['id_str'])
In [11]:
tw
Out[11]:
In [12]:
t=Tweet(tw)
In [13]:
t.is_retweet
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: